NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Reconfigurable Stream Network Architecture

https://doi.org/10.1145/3695053.3731088

Wang, Chengyue; Zhang, Xiaofan; Cong, Jason; Hoe, James C (June 2025, ACM)

As AI systems grow increasingly specialized and complex, managing hardware heterogeneity becomes a pressing challenge. How can we efficiently coordinate and synchronize heterogeneous hardware resources to achieve high utilization? How can we minimize the friction of transitioning between diverse computation phases, reducing costly stalls from initialization, pipeline setup, or drain? Our insight is that a network abstraction at the ISA level naturally unifies heterogeneous resource orchestration and phase transitions. This paper presents a Reconfigurable Stream Network Architecture (RSN), a novel ISA abstraction designed for the DNN domain. RSN models the datapath as a circuit-switched network with stateful functional units as nodes and data streaming on the edges. Programming a computation corresponds to triggering a path. Software is explicitly exposed to the compute and communication latency of each functional unit, enabling precise control over data movement for optimizations such as compute-communication overlap and layer fusion. As nodes in a network naturally differ, the RSN abstraction can efficiently virtualize heterogeneous hardware resources by separating control from the data plane, enabling low instruction-level intervention. We build a proof-of-concept design RSN-XNN on VCK190, a heterogeneous platform with FPGA fabric and AI engines. Compared to the SOTA solution on this platform, it reduces latency by 6.1x and improves throughput by 2.4x–3.2x. Compared to the T4 GPU with the same FP32 performance, it matches latency with only 18% of the memory bandwidth. Compared to the A100 GPU at the same 7nm process node, it achieves 2.1x higher energy efficiency in FP32.
more » « less
Free, publicly-accessible full text available June 20, 2026
SSDTrain: An Activation Offloading Framework to SSDs for Faster Large Language Model Training

https://doi.org/10.1109/DAC63849.2025.11132754

Wu, Kun; Park, Jeongmin Brian; Zhang, Xiaofan; Hidayetoğlu, Mert; Mailthody, Vikram Sharma; Huang, Sitao; Lumetta, Steve; Hwu, Wen-Mei (June 2025, IEEE)

The growth rate of the GPU memory capacity has not been able to keep up with that of the size of large language models (LLMs), hindering the model training process. In particular, activations—the intermediate tensors produced during forward propagation and reused in backward propagation—dominate the GPU memory use. This leads to high training overheads such as expensive weight update costs due to the small micro-batch size. To address this challenge, we propose SSDTrain, an adaptive activation offloading framework to high-capacity NVMe SSDs. SSDTrain reduces GPU memory usage without impacting performance by fully overlapping data transfers with computation. SSDTrain is compatible with popular deep learning frameworks like PyTorch, Megatron, and DeepSpeed, and it employs techniques such as tensor deduplication and forwarding to further enhance efficiency. We extensively experimented with popular LLMs like GPT, BERT, and T5. Results demonstrate that SSDTrain reduces 47% of the activation peak memory usage. At the same time, SSDTrain perfectly overlaps the I/O with the computation and incurs negligible overhead. Compared with keeping activations in GPU memory and layerwise full recomputation, SSDTrain achieves the best memory savings with negligible throughput loss. We further analyze how the reduced activation memory use may be leveraged to increase throughput by increasing micro-batch size and reducing pipeline parallelism bubbles.
more » « less
Free, publicly-accessible full text available June 22, 2026
AutoAI2C: An Automated Hardware Generator for DNN Acceleration on Both FPGA and ASIC

https://doi.org/10.1109/TCAD.2024.3393428

Zhang, Yongan; Zhang, Xiaofan; Xu, Pengfei; Zhao, Yang; Hao, Cong; Chen, Deming; Lin, Yingyan Celine (October 2024, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems)

Full Text Available
Exploring HW/SW Co-Design for Video Analysis on CPU-FPGA Heterogeneous Systems

https://doi.org/10.1109/TCAD.2021.3093398

Zhang, Xiaofan; Ma, Yuan; Xiong, Jinjun; Hwu, Wen-mei; Kindratenko, Volodymyr; Chen, Deming (June 2021, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems)
null (Ed.)
Full Text Available
AutoDNNchip: An Automated DNN Chip Predictor and Builder for Both FPGAs and ASICs

https://doi.org/10.1145/3373087.3375306

Xu, Pengfei; Zhao, Yang; hao, Cong; Zhang, Xiaofan; Guan, Zetong; Zhang, Yongan; Wang, Yue; Chen, Deming; Lin, Yingyan (January 2020, 28th ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA 2020))

Full Text Available
Large-scale retrieval for medical image analytics: A comprehensive review

https://doi.org/10.1016/j.media.2017.09.007

Li, Zhongyu; Zhang, Xiaofan; Müller, Henning; Zhang, Shaoting (January 2018, Medical Image Analysis)

Full Text Available

Search for: All records